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Introduction 

Either directly or indirectly, the lexicon for a nat- 
ural language specifies complementation frames or 
valences for open-class words such as verbs and 
nouns. Constructing a lexicon of complementation 
frames for large vocabularies constitutes a chal- 
lenge of scale, with the further complication that 
frame usage, like vocabulary, varies with genre and 
undergoes ongoing innovation in a living language. 
This paper addresses this problem by means of a 
learning technique based on probabilistic lexical- 
ized context free grammars and the expectation- 
maximization (EM) algorithm. Given a hand- 
written grammar and a text corpus, frequencies of 
a head word accompanied by a frame are estimated 
using the inside-outside algorithm, and such fre- 
quencies are used to compute probability param- 
eters characterizing subcategorization. The pro- 
cedure can be iterated for improved models. We 
show that the scheme is practical for large vocab- 
ularies and accurate enough to capture differences 
in usage, such as those characteristic of different 
domains. 

A grammar and formalism 

The core of the grammar is an X grammar (Jack- 
endoff [1977]) of phrases including noun phrases, 
prepositional phrases, and verbal clusters. A rep- 
resentative verbal structure is given on the left in 
Figure |l|. The symbol VFC is read "finite verb 
chunk"; similarly we work with noun chunks (nc), 
prepositional chunks (PC), and so forth. Our use 
of the chunk concept follows Abney [1991], Abney 
[1995]. Categories are interpretable in terms of a 
feature decomposition, but are treated as atomic 



in the formalism. We depart from a standard 
context-free formalism in that heads are marked 
on the right hand sides of rules, using a prime ('). 

The grammar includes complementation rules 
for verbs, nouns, and adjectives. Complements are 
attached at a level above the chunk, which we call 
the phrasal level. For instance, the category vfp 
is expanded clS 3f finite verb chunk VFC and a se- 
quence of complements. This is illustrated on the 
right in Figure 1, where the VFC headed by decided 
takes a vtop complement, the VTOC headed by 
emphasize takes an np complement, and so forth. 

Finally, the least standard part of the grammar 
is a large set of state or n-gram rules which form a 
parse without constructing a standard clause-level 
analysis. Instead, phrasal categories are strung to- 
gether with context-free rules modelling a finite 
state machine, where the states are categories con- 
sisting of an ordered pair of phrasal categories. 
This results in right-branching structures, as il- 
lustrated Figure 2. Note that the entire tree on 
the right in Figure 1 could be substituted for the 
finite verb phrase vfp in the tree on the left in 
Figure ^. The state rules allow almost all the sen- 
tences (about 97%) in the corpus to be parsed, 
at the price of not assigning linguistically realistic 
higher-level structure. 

We now define headed context-free grammars in 
the sense employed here. 

Definition. A headed context free grammar is a 
tuple {N,T,W, C,TZ, s), where: (i) N and T are 
disjoint sets, interpreted as the non-terminal and 
terminal categories respectively, (ii) 1^ is a set, 
interpreted as the set of words, (iii) £ is a relation 
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Figure 1: Illustrations of a finite verb chunk and complementation. 
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between W and T, indicating the possible terminal 
categories (parts of speech) for a given word, (iv) 
The set of headed productions 7^ is a finite subset 
of N X N* X {N UT) X N*, such that each non- 
terminal occurs as the left hand side of some rule 
and each terminal occurs on the right hand side of 
some rule, (v) s € N, with the interpretation of a 
start symbol. 

We typically use n as a variable for mother cat- 
egories, n for head daughter categories, and a and 
P for the category sequences flanking the head on 
the right hand side, so that {n,a,n,P) represents 
a rule, x is used as a variable for non-head cate- 
gories. 

A category n in is a projection of a category n 
in NUT if there is some rule of the form {n,a,n, P). 
The set of lexicalized nonterminals M '^W x N is 
the composition of C with the transitive closure of 
the projection relation. We have {w, n) eJ\f if the 
word w can be the lexical head of the nonterminal 
category n (in a complete or incomplete tree). 

Lexicalization and the probability model 

This section defines a parameterized family of 
probability distributions over the trees licensed by 
a head-lexicalixed CFG. The main ideas on the pa- 
rameterization of a lexicalized context free gram- 
mar which are employed here derive from Char- 



niak [1995]; see also the remarks on lexicalization 
in Charniak [1993, section 8.4]. 

The head marking on rules is used to project 
lexical items up a chain of categories. In the tran- 
sitive verb phrase on the right in Figure 2, ques- 
tion is projected to the np level, and asked is pro- 
jected to the VFP level. In this tree, the non- 
terminal nodes are lexicalized non-terminals, while 
the terminal nodes are members of C. The point 
of projecting head words is to make information 
which probabilistically conditions rules and lexical 
choices available at the relevant level. At the top 
level in this example, the head asked is used to con- 
dition the choice of the phrase structure rule vfp 
VFc' NP as well as the choice of question, the 
head of the object. 

We now define events which characterize choices 
of rules and of lexical heads. 

Definition. Given a 

grammar G = {N,T,W, C,TZ, s) with lexicalized 
non-terminals J\f, the set of rule events ER[G) is 
the set of tuples {w,n,a,n, (3) such that {w, n) is 
an element of TV and (n, a, n, (3) is an element of 
TZ. The set of lexical choice events EL{G) is the 
set of tuples {w, n, x, v) such that (i) {w, n) and 
{v,x) are elements of A/";]] (ii) in some rule of the 



^ In the events, conditioning factors are ordered in the 
way they are dropped off in the smoothing procedure de- 



form (n, a, n, f3), x is an element of one or both of 
the category sequences a and /3; and 

By virtue of the length of tuples, ER{G) and 
EL{G) are disjoint, and the union E{G) can be 
formed without confusing lexical with rule events. 

A head-lexicalized PCFG is represented as a 
function mapping events to real numbers. 
Definition. Let G be a headed context free gram- 
mar. A head-lexicalized probabilistic context free 
grammar with signature G is a function p with 
domain E{G) and range [0, 1] satisfying the con- 
ditions: (i) Fixing any lexicalized non-terminal 
{w,n), Y.a,n,pPw,n,a,n,f3 = 1; (h) Fixing any lex- 
icalized non-terminal n) and possible non-head 
daughter x, J2x,wPw,n,x,w = 1- Here the value 
of the function p on a rule event is written as 
Pid,n,a,n,i3, and On a lexical event as Pw,n,x,w 

To assign probability weights to trees, we use 
a tree-licensing and labelling interpretation of the 
grammar; a node in a tree analysis is labeled with 
event corresponding to the rule used to expand the 
node, and the list of lexical events for the non-head 
daughters of the node. Where r is a labeled tree 
licensed by G, we define e(T) : E{G) ^ W to be a 
function counting occurrences of events as labels in 
T. Algebraically, we think of e(r) as a monomial in 
the variables E(G); the exponent of a given vari- 
able (or event) z is the number of occurrences of z 
in r. We denote the evaluation of a polynomial or 
monomial (j) in the variables E{G) by subscripting: 
(j)p is the value of (f) at the vector of reals p. Rela- 
tive to a parameter setting p, [e(r)]p is interpreted 
as the probabilistic weight of the labeled tree t.| 

These notions are exemplified in Figure ^ which 
is a phrase structure tree for the Nl (read: N-bar) 
hig big problem in a grammar where Nl is the sen- 
tence category. Each non-terminal is labeled with 
a phrase structure rule, and with lexical choice 

scribed below. In a lexical event {w,n, x,v) , the choice of 
the word v is conditioned on the parent lexical head w, the 
parent category n, and the child category x. In the first 
smoothing distribution, the first conditioning factor, i.e. the 
parent head w, is dropped. 

^ As with ordinary PCFGs, depending on the parame- 
ters, the construction may or may not define a probability 
measure on the set of finite trees licensed by G. For the gen- 
eral case, infinite trees can be included in the sample space. 
This requires an extension in the definition of the measure 
but does not affect the probabilities of finite trees. 



events for non-head daughters. In this case, the 
only non-head daughters are the two Al's headed 
with head big. (problem, N1,A1, big) is a lexical 
choice event where big is selected as the head of 
an Al with parent category Nl, and parent head 
problem. An event monomial corresponding to the 
event tree is obtained as the symbolic product of 
the events labeling the tree. 

Parameter Estimation 

Given a grammar G, the inductive problem is to es- 
timate a head-lexicalized PCFG with signature G. 
We work with the standard method for estimating 
PCFGs, based on the Expectation-Maximization 
framework (Baum & Sell [1968]; Dempster, Laird 
& Rubin [1977]). 

Above, we defined the event polynomial e(T) for 
an event tree r licensed by G. The event polyno- 
mial for a sentence a is the sum of the event poly- 
nomials for the event trees with yield a. Where 
corpus C is a sequences of sentences, the corpus 
event polynomial e(C) is the (polynomial) prod- 
uct of the event polynomials for the sentences in 
C. In these terms, maximum likelihood estimation 
selects a parameter setting p such that the value 
[e(C)]p of the corpus polynomial is maximized; this 
corresponds to selecting a parameter setting which 
maximizes the probability of the corpus. 

The E step of the EM algorithm computes an 
expected event count function which can be de- 
fined in terms of the corpus polynomial. In the 
estimation of PCFGs using the inside-outside algo- 
rithm, event counts are computed iteratively, sen- 
tence by sentence. The computation uses a packed 
parse forest, a compact and-or graph represent- 
ing a set of trees and the sentence event poly- 
nomial, and which allows efficient computation of 
expected event counts. Somewhat more formally, 
we use the Inside-outside algorithm (Baker [1979]). 
to compute Ep{z\a) : E{G) IR where z ranges 
over events in the join rule and lexical event space 
E{G), defined earlier. c{a,p){z) has the proba- 
bilistic interpretation of the expected number of 
occurrences of the event z in the set of trees with 
yield a. 

Given a parameter setting p, event counts are 
computed and summed over the sentences in the 
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Figure 2: Left: finite-state structure; Right: Lexicafization. 
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Figure 3: On the left, an event tree. On the right, the corresponding lexicahzed tree. On the bottom, 
the event monomial obtained as a symbolic product of the event labels. The lexical choice event involving 
START-CAT chooses the head of the sentence, in this case problem. 



corpus. In the algorithm of Baum and Sell, new 
parameter values would be defined as relative fre- 
quencies of event counts, i.e. maximum-likelihood 
estimation based on hidden data in the EM frame- 
work. We use instead a modified M step involving 
a smoothing scheme in order to deal with the size 
of the parameter space and the resulting problems 
that (i) counts are zero for the majority of events, 
and (ii) the parameter space is too large to be rep- 
resented directly in computer memory. Lexicalized 
rules are smoothed against non-lexicalized rules in 
a standard back-off scheme (Katz [1980]). The 
smoothed probability is defined as a weighted sum 
of the maximum-likelihood estimates for the lex- 
icalized and unlexicalized rule probabilities. The 
smoothing weight is allowed to vary through five 
discrete values as a function of the frequency of the 
word-category pair. The parameters give greater 
weight to the lexicalized distribution when enough 
data is present to justify it. The smoothing param- 
eters are set using the EM algorithm on reserved 
data. 

For the lexical choice distributions, an absolute 
discounting scheme from Ney, Essen & Kneser 
[1994] is used, which is similar to Good- Turing, 
but somewhat simpler to work with. 

The experiment 

We estimated a head-lexicalized PCFG from parts 
of the British National Corpus (BNC Consortium 
[1995]), using the grammar described in the first 
section and the estimation method of the previ- 
ous section. A bootstrapping method was used, 
in which first a non-lexicalized probabilistic model 
was used to collect lexicalized event counts. On 
the next iteration, counts were estimated based on 
a lexicalized weighting of parses, as described in 
the previous section. 

Analyses were restricted to those consistent with 
the part of speech tags specified in the BNC, 
which are produced with a tagger. In each lexi- 
calized iteration, event counts were collected over 
a contiguous five million word segment of the cor- 
pus. Parameters were re-computed in the way de- 
scribed above, and the procedure was iterated on 
the next contiguous five- million word segment. Re- 
sults from all iterations were pooled to form a sin- 
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Table 1: On the left: the eight largest parameters 
in the lexical choice distribution describing mod- 
ifying adjectives selected by satisfactory. On the 
right: parallel information for the distribution de- 
scribing heads of objects of the verb address. 



gle model estimated from 50M words. Table |^ il- 
lustrates lexical distributions in this model. 

This training scheme allows the frame distribu- 
tions for high-frequency words a chance to con- 
verge on their true distributions, whereas a sin- 
gle 50M word iteration would not. The strategy 
derives from a variant generalized EM algorithm 
presented in Neal &: Hinton [1998]. In a nutshell, 
re-estimating the parameters during the course of 
a single training iteration will still lead to conver- 
gence on a maximum-likelihood estimate, provided 
certain conditions are met. Foremost among these 
is the requirement that no parameter setting can 
be prematurely set to zero; this is met by our 
smoothing strategy. This is not to say that pre- 
cisely the same strategy, pursued across multiple 
iterations, would produce a maximum-likelihood 
estimate; it would not. However, "classical" EM, 
requiring repeated iteration over the entire train- 
ing set, is both relatively inefficient and infeasible 
given our present computational resources. 

Dictionary Evaluation 

The comparison to frames specified in a dictionary 
we use was introduced by Brent [1993] and subse- 
quently used by Manning [1993], Ersan & Charniak 
[1995] and Briscoe & Carroll [1996]. The measure 
uses precision and recall to compare the set of in- 



duced frames to those in the standard. Precision is 
the percentage of frames that the system proposes 
that are correct (i.e. in the standard). Recall is 
the percentage of frames in the standard that the 
system proposes. If the results are broken down 
into true positives (TP), false positives (FP), true 
negatives (TN), and false negatives (FN), preci- 
sion is defined as TP/ (TP + FP) and recall is 
TP/ (TP + FN). To produce measurements from 
our system, we must first reduce our distributions 
to set membership. Brent proposed a stochastic 
filter for this reduction, consisting of a set of per- 
frame probability cutoffs, which are applied inde- 
pendently of the lexical head. Although though 
the independence assumption is certainly dubious, 
we have adopted this method, without change, ex- 
cept for the introduction of a heuristic for finding 
the frame cutoffs. 

The key property of cutoffs is that they control 
the tradeoff of precision versus recall. Raising the 
cutoff will generally produce a higher precision, 
but lower recall, and contrariwise. As we are neu- 
tral about this tradeoff, we set the cutoffs at the 
crossover point, where the difference in precision 
and recall changes sign. This is not entirely de- 
terministic, as the measures may cross more than 
once; in that case, we optimize for the best preci- 
sion. 

For our dictionary, we used The Oxford Advanced 
Learner's Dictionary (Hornby [1985]), also used by 
Ersan/Charniak and Manning. We reduced our 
frame set and the dictionary's to a common set, 
mapping some frames and eliminating others. For 
evaluation, we selected 200 verbs at random from 
among those that occurred more than 500 times in 
the training data; half were used to set the optimal 
cutoff parameters, and precision and recall were 
measured with the remainder. 

Table § shows results broken down by frame. 
The largest source of error is the intransitive frame. 
It is not hard to understand why: our robust 
parsing architecture resolves unparsable constructs 
as intransitives. In addition to sentences where 
verbs are not linked up with their complements 
because of interjections, complex conjunctions or 
ellipses, this includes frames such as SBAR and 
WH-complements which are not included in the 
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Table 2: Precision/recall broken down by frame. 

chunk/phrase grammar. While it would be pos- 
sible in principle to extract these from the present 
word collocation statistics, we plan instead to pur- 
sue a solution involving extensions in the grammar. 

A second major source of error is prepositional 
phrases. The complementation model embodied 
in the PCFG does not distinguish complements 
from adjuncts, and therefore adjunct prepositional 
phrases are a source of false positives. Thus the 
NP PP frame is scored false positive for the 
verb meet, because the OALD does not list the 
frame, although the combination appears often in 
the corpus data. While such frames lead to a loss 
of precision in the dictionary evaluation, we do not 
necessarily consider them a fiaw in the information 
learned by the system, since the argument /adjunct 
distinction is often tenuous, and adjuncts are in 
many cases lexically conditioned. 

Lastly, there are many false negatives for the 
particle frame and noun plus particle. This is 
mainly due to disagreements between BNC par- 
ticle tagging and particle markup in the OALD. 

Despite these difficulties, the summary shown 
in table § shows results that are on the whole 
favorable. In comparison with other work with 
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52 


16 
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90 


43 
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Table 3: Type precision/recall comparison. Some 
of Manning's frames are parameterized for a prepo- 
sition. 

a comparable number of frames (Manning, Er- 
san/ Charniak), the system is well ahead on recall 
and well behind on precision. If one takes the sum 
of precision and recall to be the final performance 
indicator, than we are slightly ahead: 1.54 vs. 1.44 
for Ersan and 1.33 for Manning. Briscoe and Car- 
roll's work, with ten times as many target frames, 
is so different that the numbers may be regarded 
as incomparable. 

Obviously, precision and recall measured against 
a standard relies on the completeness and accuracy 
of that standard. In checking false positives, Ersan 
and Charniak found that the OALD was incom- 
plete enough to have a serious impact on precision. 
Symmetrically, false negatives conflate deficiencies 
in the corpus with poor learning efficiency. It is 
impossible to say based on table ^ which of the sys- 
tems is more efficient at learning. While our sys- 
tem shows the best recall, this could be attributed 
to our having the best training data. Charniak 
used 40M words of training data, comparable to 
our 50M, but his data was homogeneous, all taken 
from the Wall Street Journal. As we will show be- 
low, frame usage varies across genres, so the BNC, 
which includes texts from a wide variety of sources, 
shows more varied frame usage than the WSJ, and 
thus provides better data for frame acquisition. 

Cross entropy evaluation 

The information-theoretic notion of cross entropy 
provides a detailed measure of the similarity of the 
acquired probabilistic lexicon to the distribution of 
frames actually exhibited in the corpus (which we 
call the empirical distribution). The cross entropy 
of the estimated distribution q with the empirical 



distribution p obeys the identity 

CE{p,q)=H{p)+D{p\\q) 

where H is the usual entropy function and D is 
the relative entropy, or Kullback-Leibler distance. 
The entropy of a distribution over frames can be 
conceptualized as the average number of bits re- 
quired to designate a frame in an ideal code based 
on the given distribution. In this context, entropy 
measures the complexity of the observed frame dis- 
tribution. The relative entropy is the penalty paid 
in bits when the frame is chosen according to the 
empirical distribution p, but the code is derived 
from the model's estimated distribution, q. Rel- 
ative entropy is always non-negative, and reaches 
zero only when the two distributions are identical. 
Our goal, then, is to minimize the relative entropy. 
For more in-depth discussion of entropy measures, 
see Cover & Thomas [1991], or any introductory 
information theory text. 

For relative entropy to be finite, the estimated 
distribution q must be non-zero whenever p is. 
However, some observed frames are not present in 
the grammar, for one of two reasons. Some well- 
known frames such as SBAR require high-level con- 
structs not available in the chunk/phrase grammar 
and unusual/unorthodox frames turn up in the 
data, e.g. part pp pp. Since the model lacks these 
frames, smoothing against the unlexicalized rules 
is insufficient. Instead, for all the estimated distri- 
butions, we smooth against a Poisson distribution 
over categories, which assigns non-zero probability 
to all frames, observed or not. This allows us to 
spell out the unknown frame using a known finite 
alphabet, the grammar categories, while retaining 
a reasonable average length over frames. 

For our entropy measurements, we selected three 
verbs, allow, reach, and suffer and extracted about 
200 occurrences of each from portions of the BNC 
not used for training. Half of each sample was 
drawn from "imaginative" text and the other half 
from the natural or applied sciences, as indi- 
cated by BNC text mark-up. The true frame 
for each verb occurrence was marked by a human 
judge0. The empirical distribution was taken as 

■^For this judgment, the frame set was unrestricted, i.e. 
included frames not in the grammar. 
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Table 4: True and estimated frame frequencies for 
allow. 

the maximum-likelihood estimate from these fre- 
quencies. Tables 4 and 5 indicate the observed 
frequencies and the entropy of the resulting distri- 
butions. 

Alongside the observed frequencies, we indicate 
a set of estimated frequencies. These were gen- 
erated by taking the 50M word model described 
above, parsing the test sentences, and extracting 
the estimated frequencies. The sum of estimated 
frequencies is generally less than the observed fre- 
quencies due to tagging errors, parse failures, and 
frequency assigned to frames not shown in the ta- 
bles. However, an eyeball inspection of the tables 
shows that the parser does a good job of reproduc- 
ing the target distribution. 

One striking feature in the tables is the variation 
across genre. In particular, suffer used in the imag- 
inative genre shows a very different distribution 
than suffer in the natural sciences. A chi-squared 
test applied to each pair indicates that the sam- 
ples come from distinct distributions (confidence 
> 95%). 

The column labeled "50M lex" in Table |6| pro- 
vides a quantitative measure of the agreement be- 
tween the 50M word combined model and the em- 
pirical distributions for the three verbs in two gen- 
res in the form of relative entropy. The first column 
repeats the entropy of the data distributions. For 
purposes of comparison, the second column indi- 
cates the relative entropy of one data distribution 
with the other data distribution filling the role of 
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Table 5: True and estimated frame frequencies for 
reach (top) and suffer (bottom). 

the estimated distribution (i.e. q) in the discus- 
sion above. The relative entropy is lower when the 
estimated distribution is used for q than when the 
data distribution for the other genre is used for q in 
each case but one, where the figures are the same. 
This suggests the combined model contains fairly 
good overall distributions. 

To numerically evaluate whether the system was 
able to learn the distribution exhibited in a given 
collection of sentences, we tuned the lexicon by 
parsing the test sentences for each genre separately 
with the 50M word model, extracting the frequen- 
cies, and estimating the distribution from these. 
The results are the column 4 labeled "50M lexi- 
calized extraction" in 0. The following columns 
give the same figures for freqency extraction with 
other models. Extraction with the large lexical- 
ized model gives the best results, and gives better 
relative entropy than the 50M lexicalilazed model 
itself (in column 2). Notice that only the distri- 
butions estimated with the two 50M models are 
better than the 50M lexicalized model, though the 
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Table 6: Frame relative entropy for three verbs 
in two genres. The first column names the lexi- 
cal head and genre, and the second the entropy 
(H) of the empirical distribution over frames, p. 
By empirical distribution we mean the relative fre- 
quencies from examples scored by a human judge. 
Columns three through five give the relative en- 
tropy D{p\\q) for various related distributions. In 
column three, q is the empirical frame distribution 
for the same head, but with the complementary 
genre. In column four q is the (genre-independent) 
distribution derived from the 50M word lexicalized 
model. Column five uses the unlexicalized frame 
distribution derived from the 50M model, i.e. a 
distribution insensitive to the head verb. Lower 
relative entropy is better. 
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Table 7: Relative entropy of distributions esti- 
mated by parsing the test sentences with various 
models, and using the Inside-outside algorithm to 
produce estimated distributions q. The first col- 
umn names empirical distributions p. The second 
column repeats relative entropy for the 50M lexi- 
calized model from the previous table. The third 
gives relative entropy where q is obtained by pars- 
ing and estimating frequencies in the test sentences 
with the 50M lexicalized model. The following 
columns give the corresponding figures for a q ob- 
tained by following the same procedure with a 5M 
word lexicalized model, a 50M word unlexicalized 
model, and a 5M word unlexicalized model. 

unlexicalized one is only marginally better. In this 
sense, only the 50M lexicalized parser proves to be 
a good enough parser for genre tuning. Notice that 
with this model, tuning in no case gives worse rel- 
ative entropy, and in five out of six cases give an 
improvement. 

Notice also that relative entropy for the distribu- 
tions obtained by tuning with the 50M model are 
a good deal lower than the cross-genre figures from 
Table ^. This suggests that if we wanted to have 
a good probabilistic lexicon for, say, the imagina- 
tive genre, we would be better off using the au- 
tomatic extraction procedure on data drawn from 
that genre than using a perfect parser (or a lexi- 
cographer) on data drawn from some other genre, 
such as the natural sciences. This provides a cal- 
ibration of the accuracy of the lexicalized parser's 
estimates, and conversely demonstrates that words 
are not used in the same way in different genres. 



Optimal parses 

Although identifying a unique parse does not play 
a role in our experiment, it is potentially useful 
for applications. A simple criterion is to pick a 
parse with maximal probability; this is identified 
in a parse forest by iterating from terminal nodes, 
multiplying child probabilities and the local node 
weight at and-nodes (chart edges), and choosing a 
child with maximal probability at or-nodes (chart 
constituents). Figures || and ^ give examples of 
maximal probability probability parses. 

Other optimality criteria can be defined. The 
structure on noun chunks is often highly ambigu- 
ous, because of bracketing and part of speech am- 
biguities among modifiers. For many purposes, the 
internal structure of an noun chunk is irrelevant; 
one just wants to identify the chunk. Prom this 
point of view, a probability estimate which con- 
siders just one analysis might underestimate the 
probability of a noun chunk. In what we call a 
sum-max parse, probabilities are summed within 
chunks by the inside algorithm. Above the chunk 
level, a highest-probability tree is computed, as de- 
scribed above. 

Notes on the implementation and 
parsing times 

Software is implemented in C+-I-. The parser used 
for the bootstrap phase is a vanilla CFG chart 
parser, operating bottom-up with top-down pre- 
dictive filtering. Chart entries are assigned proba- 
bilities using the unlexicalized PCFG, and the lex- 
icalized frequencies are found by carrying out a 
modified inside-outside algorithm which simulates 
lexicalization of the chart. 

In the iterative training phase, an unlexical- 
ized context-free skeleton is found with the same 
parser. We transform this into its lexicalized 
form — categories become {w, n) pairs and rules 
acquire lexical heads — and carry out the stan- 
dard inside-outside using the more elaborate head- 
lexicalized PCFG model. Average speed of the 
parser during iterative training, including pars- 
ing, probability calculation, and recording observa- 
tions, is 10.4 words per second on a Sun Sparc-20. 
The memory requirements for a model generated 
from a 5M word segment are about 90Mbyte. The 



upshot of all this is that we can train about IM 
words per day on one machine, and a single 5M 
word iteration requires one machine work week. 

Discussion 

We believe the formalism and methodology de- 
scribed here have the following advantages: 

• The grammar is under the control of the compu- 
tational linguist and is of a familiar kind, mak- 
ing it possible to incorporate standard linguis- 
tic analyses, and making results interpretable 
in terms of linguistic theory. In contrast, ap- 
proaches where context free rules are learned are 
likely to produce structures which are uninter- 
pretable in terms of linguistic theory and prac- 
tice. 

• Because of the context free framework, efficient 
parsing algorthims (chart parsing) and proba- 
bilistic algorithms (the inside-outside algorithm) 
can be applied. With an efficient implementa- 
tion, this makes it possible to construct repre- 
sentations of all the tree analyses for the sen- 
tences in corpora on the scale of ten to a hun- 
dred million words, and to map such a corpus to 
a probabilistic lexicon. 

• With the robustness introduced by the state 
model, almost all sentences in the corpus can 
be parsed. 

• The model assigns probabilities to sentences and 
trees, which is useful for applications indepen- 
dent of the lexicon-induction problem discussed 
here. 

• The word-selection model, which threads a word 
bigram model through head relations in the syn- 
tactic tree, allows a large body of word-word col- 
locations to be learned from the corpus, and put 
to use in weighting of competing analyses. 

• The valence information learned, rather than be- 
ing simply a set of subcategorization frames, is 
a probability distribution which reflects the fre- 
qency of frames in a given training sample, and 
which can be plugged back into the parser and 
used to analyze further text. 
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Figure 4: The first part of maximum probability parse. 
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Figure 5: The second part. 



Some of these benefits are purchased at the cost 
of a lack of sophistication in the grammar formal- 
ism, compared to constraint-based formalisms used 
in contemporary computational linguistics. This 
compromise is made in order to make large-scale 
experiments achievable; our interest is in conduct- 
ing scientific experiments — observational and mod- 
eling experiments — with large bodies of language 
use. It is natural that this should require incor- 
porating approximations in computational mod- 
els. Notably, the compromises made in our ap- 
proach are not so severe that the grammatical 
analyses identified and the probability parame- 
ters learned are out of touch with linguistic real- 
ity. This is in contrast to the situation with other 
approaches using similar mathematical methods, 
such as terminal-string n-gram modeling. 

Conclusion 

We have presented a statistically-based method 
for valence induction, based on the idea of auto- 
matic tuning of the probability parameters of a 
grammar. On the standard precision/recall mea- 
sures, our system performs better on precision, 
worse on recall, and on the whole somewhat bet- 
ter than other published systems. We have pro- 
vided a more precise evaluation via entropy mea- 
sures, showing that the model learns efficiently and 
builds accurate models of frame distributions. The 
cross-domain entropy of the data frame distribu- 
tions provides numerical evidence that frame usage 
varies across domains, similar to word usage. This, 
in turn, suggests that automatic acquisition and 
stochastic tuning are a must for large-scale NLP 
applications and computational linguistic models. 
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